Tidymodels workflow
For linear regression, the number of predictors must not exceed number of observations.
There must not be any missing data for the outcome variable (if so, they should be imputed). Data must be centered and scaled, and correlated variables should be removed (either by subset selection, shrinkage methods - lasso or ridge regression, or by the use of dimensionality reduction - PCA or PLS)
The exercise below is done using the dataset from Applied Predictive Modelling book, and the workflow followed that outlined in this youtube video, and I also took reference from Julia Silge’s blogpost on lasso regression for The Office.
"These data are recorded on a Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission (NIT) principle. Each sample contains finely chopped pure meat with different moisture, fat and protein contents.
“For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.”
data(tecator)
absorbance_df <- as.tibble(absorp) %>%
clean_names()
# glimpse(absorbance_df) # absorbance
endpoints_df <- as.tibble(endpoints) %>%
rename("water" = V1,
"fat" = V2,
"protein" = V3)
glimpse(endpoints_df)
Rows: 215
Columns: 3
$ water <dbl> 60.50, 46.00, 71.00, 72.80, 58.30, 44.00, 44.00, 69.…
$ fat <dbl> 22.5, 40.1, 8.4, 5.9, 25.5, 42.7, 42.7, 10.6, 19.9, …
$ protein <dbl> 16.7, 13.5, 20.5, 20.7, 15.5, 13.7, 13.7, 19.3, 17.7…
ir_df <- bind_cols(endpoints_df, absorbance_df) %>%
select(-water, -protein) # to predict fat content from IR spectrum
# glimpse(ir_df)
ir_df %>%
ggplot(aes(fat)) +
geom_freqpoly() +
theme_classic()
summary(ir_df$fat) # fat (Y) is left skewed
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.90 7.30 14.00 18.14 28.00 49.10
lm_recipe <-
recipe(fat ~ ., data = ir_training) %>%
# step_sqrt(fat) %>%
#step_center(all_predictors()) # centering is recommended for spectral data to minimise noise
step_normalize(everything())
lm_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 100
Operations:
Centering and scaling for everything()
# To set a linear model, but to tune to see whether to use OLS, ridge or lasso, or elasticnet
lm_model <- linear_reg(penalty = tune(),
mixture = tune()) %>%
set_engine("glmnet")
lm_model
Linear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
# for penalty aka tuning parameter = 0, model is the same as OLS
# for penalty aka tuning parameter = 1, model is a null model without predictors
# for ridge regression, mixture (alpha) = 0
# for lasso regression, mixture (alpha) = 1
grid <- expand_grid(
penalty = seq(0, 100, by = 10),
mixture = seq(0, 1, by = 0.1)
)
results <- tune_grid(
lm_model,
preprocessor = lm_recipe,
grid = grid,
resamples = ir_cv
)
results %>%
show_best(metric = "rmse")
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0 0.5 rmse standard 0.241 10 0.0145 Preprocessor…
2 0 0.4 rmse standard 0.241 10 0.0141 Preprocessor…
3 0 0.2 rmse standard 0.243 10 0.0139 Preprocessor…
4 0 0.7 rmse standard 0.247 10 0.0140 Preprocessor…
5 0 0.1 rmse standard 0.247 10 0.0132 Preprocessor…
results %>%
show_best(metric = "rsq")
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0 0.5 rsq standard 0.950 10 0.00560 Preprocessor…
2 0 0.4 rsq standard 0.950 10 0.00576 Preprocessor…
3 0 0.2 rsq standard 0.949 10 0.00590 Preprocessor…
4 0 0.7 rsq standard 0.948 10 0.00577 Preprocessor…
5 0 0.1 rsq standard 0.948 10 0.00605 Preprocessor…
lm_best_param <- results %>%
select_best(metric = "rmse")
lm_best_param
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0 0.5 Preprocessor1_Model056
lm_best_model <-
linear_reg(penalty = 0, mixture = 0.5) %>%
set_engine("glmnet")
lm_final_workflow <-
workflow() %>%
add_model(lm_best_model) %>%
add_recipe(lm_recipe)
lm_final_workflow
══ Workflow ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ──────────────────────────────────────────────────────
1 Recipe Step
• step_normalize()
── Model ─────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 0
mixture = 0.5
Computational engine: glmnet
# final fit to training dataset
lm_final_workflow %>%
fit(ir_training) %>%
extract_fit_parsnip() %>%
vip::vip(num_features = 20) +
theme_classic()
# These are the wavelengths of interest
seq(850,1050,2) %>%
as_tibble() %>%
rowid_to_column() %>%
filter(rowid %in% c(1:3, 39:42))
# A tibble: 7 × 2
rowid value
<int> <dbl>
1 1 850
2 2 852
3 3 854
4 39 926
5 40 928
6 41 930
7 42 932
# predict
lm_fit <- lm_best_model %>%
fit(fat ~., data = ir_df)
tidy(lm_fit)
# A tibble: 101 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 16.8 0
2 v1 47.1 0
3 v2 23.5 0
4 v3 8.33 0
5 v4 9.26 0
6 v5 0.175 0
7 v6 0 0
8 v7 0 0
9 v8 0 0
10 v9 0 0
# … with 91 more rows
# last fit and evaluate on test dataset
final_fit <-
lm_final_workflow %>%
last_fit(ir_split)
final_fit %>%
collect_metrics() # on test dataset. rmse: 0.246, rsq: 0.934
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 0.246 Preprocessor1_Model1
2 rsq standard 0.934 Preprocessor1_Model1
# predicted fat values
predict(lm_fit, ir_testing)
# A tibble: 54 × 1
.pred
<dbl>
1 19.6
2 10.6
3 24.0
4 15.3
5 4.44
6 9.47
7 9.13
8 7.59
9 9.94
10 17.4
# … with 44 more rows
ir_aug <-
augment(lm_fit, ir_testing)
# glimpse(ir_aug)
# visualizing predicted fat vs actual fat content for test dataset
ir_aug %>%
ggplot(aes(fat, .pred)) +
geom_point(col = "darkblue", size = 3) +
geom_abline(lty = 2, col = "grey50", size = 2) +
scale_x_continuous(expand = c(0,0), limits = c(0, 60)) +
scale_y_continuous(expand = c(0,0), limits = c(0, 60)) +
labs(title = "Predicted vs Actual Fat Content for Test Dataset",
x = "Actual Fat %",
y = "Predicted Fat %") +
theme_classic() +
coord_equal()
# accuracy is slightly off for higher fat % samples - tendency to over-estimate.
For PLS, I attempted to do it earlier here. The workflow is the same, except that the preprocessing did not normalise everything.
For attribution, please cite this work as
lruolin (2021, Nov. 25). pRactice corner: Applied predictive modelling book - Chap 6: Linear regression. Retrieved from https://lruolin.github.io/myBlog/posts/20211125 - Linear regression (Chap 6 Applied Predictive Modelling)/
BibTeX citation
@misc{lruolin2021applied, author = {lruolin, }, title = {pRactice corner: Applied predictive modelling book - Chap 6: Linear regression}, url = {https://lruolin.github.io/myBlog/posts/20211125 - Linear regression (Chap 6 Applied Predictive Modelling)/}, year = {2021} }